Analysis of KDM5 copy number and response to KDM5 did show a correlation. Here this relationship is visualized and investigated in a larger collection of samples. Additionally, the drug response is correlated with expression of KDM5.
Only the KDM5A and KDM5B paralogues are investigated. KDM5C and KDM5D are not included because they are located on the X and Y chromosomes respectively. Sex chromosomes are routinely excluded from CNV estimation because of their complex behavior.
Sensitivity to KDM5 inhibitor CPI-455 was determined for the tumoroids of the NETG1G2 cohort. Also for some cell lines this information may be available but needs to be retrieved first.
Copy number data is available for the samples of the NETG1G2 cohort, the samples from DiDomenico 2020 and several cell lines (BON1, QGP1, NT3)
Copy number analysis was not performed for all samples of DiDomenico 2020. While it is possible to add the information for the missing samples this is not done at the moment because the published CNV information was validated experimentally. Adding the missing samples may leave the impression that they were confirmed experimentally too.
# These functions are taken from 044_Cell_line_hypoxia
# They are used to extract the information about loci of interest from a saved conumee object
# Some arguments were static and now are function
# genes of interest
detail_regions = GRanges(seqnames = c("chr11", "chr12", "chr1", "chr1", "chr19", "chr9"),
ranges = IRanges(start = c(64570982, 389295, 202696526, 44115829, 4969125, 6720863),
end = c(64578766, 498620, 202778598, 44171186, 5153606, 7175648)),
strand = c("*"),
name = c("MEN1", "KDM5A", "KDM5B", "KDM4A", "KDM4B", "KDM4C"))
# The information about the cromosome arms is used later to generate heatmaps
hg19_arms = read_delim("/Bioinformatics/projects/shared_data/hg19_UCSC_cytoBand.txt.gz",
delim = "\t",
col_names = c("Chromosome", "Start", "End", "Band", "Giemsa"),
show_col_types = F)
hg19_arms = hg19_arms %>%
mutate(Arm = str_sub(Band, 1, 1)) %>%
group_by(Chromosome, Arm) %>%
summarize(Start = min(Start),
End = max(End),
Length = End - Start,
.groups = "drop")
createAnno = function(data_type){
data("exclude_regions")
# The array type is set to overlap because I'm using both EPIC and 450K data
CNV.create_anno(array_type = data_type,
exclude_regions = exclude_regions,
detail_regions = detail_regions)
}
extractCNABins = function(CNA_data){
# for each samples the bins are extracted and merged into one data frame
raw_bins = bind_rows(lapply(names(CNA_data), function(x){
bin_df = CNV.write(CNA_data[[x]], what = "bins")
colnames(bin_df)[5] = "CNV"
bin_df$Sample_Name = x
return(bin_df)
}))
# The bins are annotated with the chromosome arms
# Most likely there is a very elegant solution using GRanges and Rtracklayer
raw_bins = bind_rows(lapply(unique(raw_bins$Chromosome), function(x){
current_chr = raw_bins %>%
filter(Chromosome == x)
current_p = hg19_arms %>%
filter(Chromosome == x & Arm == "p")
current_chr %>%
mutate(Arm = ifelse(End < current_p$End, "p", "q"))
}))
}
getCNVvalues = function(gene_name,
CNV_bins,
anno){
gene_region = detail_regions[detail_regions$name == gene_name]
gene_bins = findOverlaps(gene_region, anno@bins, select = "all")
gene_bins = names(anno@bins[subjectHits(gene_bins)])
CNV_bins %>%
filter(Feature %in% gene_bins) %>%
group_by(Sample_Name) %>%
mutate(Start = min(Start),
End = max(End)) %>%
ungroup() %>%
pivot_wider(names_from = Feature,
values_from = CNV,
names_glue = "bin_{Feature}") %>%
mutate(average_CNV = rowMeans(across(starts_with("bin"), ~ .x))) %>%
dplyr::select(1:5,
average_CNV,
starts_with("bin"))
}
All expression data is taken from
046_RNASeq_count_data_collection. For all RNA-Seq data sets
the raw counts and normalized data is available.
Because data is compared across diverse data sets the
TMM normalization is chosen. This normalizes each sample by
the library size making the counts more comparable across experiments.
However, differences in library composition may systematically skew the
TMM values between experiments. Therefore, comparison
between experiments should be done carefully.
Wherever possible data mapped to GRCh38 is used.
030_NETG1G2_EPICFor the samples in this cohort drug response was measured on the tumoroids. DNA methylation data is only available for the primary tumors and the drug sensitivity information is transferred from the tumoroids. In turn tumoroids are assigned the same CNV as the matched tumors. Expression data is available for some tumors and tumoroids.
In addition, the DAXX / ATRX / MEN1 mutation status was investigated. All available samples are DAXX / ATRX mutated. For MEN1 no muation data is available for these samples. Therefore, no figures are shown.
# For each data set there is a separate import function to handle the many differnet formats
loadNETG1G2 = function(){
meth_dir = "/Bioinformatics/projects/030_NETG1G2_EPIC"
meta_meth = readRDS(file.path(meth_dir,
"meta_data/220610_NETG1G2_EPIC_normalization_metaData.Rds")) %>%
filter(!is.na(CPI455_GR_AOC_median_sens)) %>%
dplyr::select(Sample_Name,
CPI455_GR_AOC_median_sens,
DAXX_ATRX,
MEN1)
kdm5a = read.xlsx(file.path(meth_dir,
"220719_NETG1G2_EPIC_CNV_details.xlsx"),
sheet = 2) %>%
dplyr::select(Sample_Name,
avg_CNV_KDM5A = average_CNV,
bin_1_KDM5A = `bin_chr12-0005`)
meta_meth = left_join(meta_meth,
kdm5a,
by = "Sample_Name")
kdm5b = read.xlsx(file.path(meth_dir,
"220719_NETG1G2_EPIC_CNV_details.xlsx"),
sheet = 3) %>%
dplyr::select(Sample_Name,
avg_CNV_KDM5B = average_CNV,
bin_1_KDM5B = `bin_chr1-1207`,
bin_2_KDM5B = `bin_chr1-1208`)
meta_meth = left_join(meta_meth,
kdm5b,
by = "Sample_Name")
meta_rna = read_delim(file.path(expression_dir,
"NETG1G2_tumoroids",
"NETG1G2_tumoroids_meta_data.csv"),
delim = ",",
show_col_types = F) %>%
dplyr::select(Sample_Name = aPmP_number_ext,
Sample_ID,
Sample_Type,
CPI455_GR_AOC_median_sens)
rna_counts = read_delim(file.path(expression_dir,
"NETG1G2_tumoroids",
"NETG1G2_tumoroids_GRCh38_Ensembl104_STAR_TMM.csv"),
delim = ",",
show_col_types = F) %>%
filter(gene_name %in% c("KDM5A", "KDM5B", "KDM5C", "KDMD")) %>%
as.data.frame()
rownames(rna_counts) = rna_counts$gene_name
rna_counts = t(rna_counts[, grepl("sapril", colnames(rna_counts))])
rna_counts = rna_counts %>%
as.data.frame() %>%
rownames_to_column("Sample_ID")
meta_rna = left_join(meta_rna,
rna_counts,
by = "Sample_ID")
left_join(meta_meth,
meta_rna,
by = "Sample_Name",
multiple = "all",
suffix = c("", "_dup")) %>%
mutate(Sample_Type = ifelse(is.na(Sample_Type), "tumor_tissue", Sample_Type))
}
NETG1G2 = loadNETG1G2()
cowplot::plot_grid(
NETG1G2 %>%
filter(Sample_Type != "tumoroid") %>%
arrange(!is.na(KDM5A)) %>%
ggplot(aes(x = CPI455_GR_AOC_median_sens,
y = avg_CNV_KDM5A,
fill = CPI455_GR_AOC_median_sens)) +
geom_boxplot(width = 0.5,
alpha = 0.75,
show.legend = F) +
geom_point(size = 2,
show.legend = F) +
geom_text_repel(aes(label = Sample_Name,
fill = NULL)) +
scale_fill_brewer(palette = "Set1") +
theme_classic() +
labs(title = "copy number variation",
y = "average CNV at locus",
x = "median CPI-455 sensitivity"),
NETG1G2 %>%
filter(Sample_Type != "tumoroid" & !is.na(KDM5A)) %>%
ggplot(aes(x = CPI455_GR_AOC_median_sens,
y = KDM5A,
fill = CPI455_GR_AOC_median_sens)) +
geom_boxplot(width = 0.5,
alpha = 0.75,
show.legend = F) +
geom_point(size = 2,
show.legend = F) +
geom_text_repel(aes(label = Sample_Name,
fill = NULL)) +
scale_fill_brewer(palette = "Set1") +
theme_classic() +
labs(title = "gene expression",
y = "log2 TMM gene expression",
x = "median CPI-455 sensitivity"),
NETG1G2 %>%
filter(!is.na(KDM5A)) %>%
ggplot(aes(x = avg_CNV_KDM5A,
y = KDM5A,
color = Sample_Name,
shape = Sample_Type)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression"),
ncol = 3,
rel_widths = c(2,2,3)
)
cowplot::plot_grid(
NETG1G2 %>%
filter(Sample_Type != "tumoroid") %>%
arrange(!is.na(KDM5B)) %>%
ggplot(aes(x = CPI455_GR_AOC_median_sens,
y = avg_CNV_KDM5B,
fill = CPI455_GR_AOC_median_sens)) +
geom_boxplot(width = 0.5,
alpha = 0.75,
show.legend = F) +
geom_point(size = 2,
show.legend = F) +
geom_text_repel(aes(label = Sample_Name,
fill = NULL)) +
scale_fill_brewer(palette = "Set1") +
theme_classic() +
labs(title = "copy number variation",
y = "average CNV at locus",
x = "median CPI-455 sensitivity"),
NETG1G2 %>%
filter(Sample_Type != "tumoroid" & !is.na(KDM5B)) %>%
ggplot(aes(x = CPI455_GR_AOC_median_sens,
y = KDM5B,
fill = CPI455_GR_AOC_median_sens)) +
geom_boxplot(width = 0.5,
alpha = 0.75,
show.legend = F) +
geom_point(size = 2,
show.legend = F) +
geom_text_repel(aes(label = Sample_Name,
fill = NULL)) +
scale_fill_brewer(palette = "Set1") +
theme_classic() +
labs(title = "gene expression",
y = "log2 TMM gene expression",
x = "median CPI-455 sensitivity"),
NETG1G2 %>%
filter(!is.na(KDM5B)) %>%
ggplot(aes(x = avg_CNV_KDM5B,
y = KDM5B,
color = Sample_Name,
shape = Sample_Type)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression"),
ncol = 3,
rel_widths = c(2,2,3)
)
The CNV plots show an enrichment of low CNVs in the good responders but variability is very high. Additionally, several of the CNV plots show ambiguous signal (i.e. equal numbers of gains and losses make it hard to estimate the true baseline).
In PanNETs 3p and 11q are frequently lost but never amplified. This can be used to identify samples with problematic CNV baseline.
knitr::include_graphics(file.path("/Bioinformatics/projects/030_NETG1G2_EPIC",
"processed_data/CNA_plots/Illumina_normalized/full_genome",
paste0(c("aP487p", "aP321m", "aP476m", "mP041p"),
"_CNA.png")))
knitr::include_graphics(file.path("/Bioinformatics/projects/030_NETG1G2_EPIC",
"processed_data/CNA_plots/Illumina_normalized/full_genome",
paste0(c("mP078p", "mP058p", "mP044p", "mP072p", "mP029p"),
"_CNA.png")))
There is a correlation between CNVs and response to KDM5 inhibition via CPI-455. At the KDM5A locus low response is enriched in amplifications. Over the whole genome low response is liked to a higher number of aberrant chromosomes.
Based on the experience with the samples from Di Domenico 2020 it is expected that baseline correction in most cases results in a reduction of CNV values. It is hard to predict how this might influence the split across CPI-455 sensitivity.
The link between gene expression and CNV is low. This may be a hint that CNV estimates are not sufficiently precise without baseline correction.
044_Cell_lines_hypoxiaThe gene expression was quantified using very different approaches
(QGP1 / BON1: HISAT, NT3 / NT18: STAR).
Therefore the expression values are not directly comparable and the data
sets are not visualized together.
# The cell lines are loaded together because the CNV information is joined
loadQBNT = function(){
meth_dir = "/Bioinformatics/projects/044_Cell_lines_hypoxia"
meta_meth = readRDS(file.path(meth_dir,
"meta_data/221212_CEL_metaData.Rds")) %>%
dplyr::select(Sample_Name,
Cell_Line,
Treatment)
# If the gene CNV information is missing generate it or load from file
if (!file.exists("processed_data/230614_cell_line_CNV_details.xlsx")){
anno = createAnno("overlap")
CNV_data = readRDS(file.path(meth_dir,
"processed_data/221212_CNV_conumee.Rds"))
CNV_bins = extractCNABins(CNV_data$ilmn)
gene_details = setNames(lapply(detail_regions$name,
getCNVvalues,
CNV_bins = CNV_bins,
anno = anno),
nm = detail_regions$name)
write.xlsx(gene_details,
"processed_data/230614_cell_line_CNV_details.xlsx")
rm(CNV_bins)
gc()
}
kdm5a = read.xlsx("processed_data/230614_cell_line_CNV_details.xlsx",
sheet = 2) %>%
dplyr::select(Sample_Name,
avg_CNV_KDM5A = average_CNV,
bin_1_KDM5A = `bin_chr12-0005`)
kdm5b = read.xlsx("processed_data/230614_cell_line_CNV_details.xlsx",
sheet = 3) %>%
dplyr::select(Sample_Name,
avg_CNV_KDM5B = average_CNV,
bin_1_KDM5B = `bin_chr1-1207`,
bin_2_KDM5B = `bin_chr1-1208`)
meta_meth = left_join(meta_meth,
kdm5a,
by = "Sample_Name")
meta_meth = left_join(meta_meth,
kdm5b,
by = "Sample_Name")
# The NT3 / NT18 samples names need to be matched
meta_meth = meta_meth %>%
mutate(Sample_Name = ifelse(grepl("^NT", Sample_Name),
gsub("([0-9])_([NH])", "\\1\\2", Sample_Name),
Sample_Name))
NT_counts = read_delim(file.path(expression_dir,
"Cell_lines_hypoxia",
"Cell_lines_hypoxia_GRCh38_Ensembl104_STAR_TMM.csv"),
delim = ",",
show_col_types = F) %>%
filter(gene_name %in% c("KDM5A", "KDM5B", "KDM5C", "KDM5B")) %>%
as.data.frame()
rownames(NT_counts) = NT_counts$gene_name
NT_counts = t(NT_counts[, grepl("^NT", colnames(NT_counts))])
NT_counts = NT_counts %>%
as.data.frame() %>%
rownames_to_column("Sample_Name")
QB_counts = read_delim(file.path(expression_dir,
"Cell_lines_DAXX_ATRX",
"Cell_lines_DAXX_ATRX_GRCh38_Ensembl94_HISAT_TMM.csv"),
delim = ",",
show_col_types = F) %>%
filter(gene_name %in% c("KDM5A", "KDM5B")) %>%
as.data.frame()
rownames(QB_counts) = QB_counts$gene_name
QB_counts = t(QB_counts[, grepl("^[BQ]", colnames(QB_counts))])
# The naming of the samples is so messed up that it is easier to fix manually
QB_counts = QB_counts %>%
as.data.frame() %>%
mutate(Sample_Name = c("BON1_DAXX_Cl16",
"BON1_DAXX_Cl45_C4",
"BON1_DAXX_Cl4_C11",
"BON1_ATRX_Cl23",
"BON1_ATRX_Cl2",
"BON1_ATRX_Cl4",
"BON1_Par_1",
"BON1_Par_2",
"BON1_Par_3",
"QGP1_ATRX_Cl12",
"QGP1_ATRX_Cl18",
"QGP1_ATRX_Cl24",
"QGP1_Par_1",
"QGP1_Par_2",
"QGP1_Par_3"))
rna_counts = bind_rows(NT_counts,
QB_counts)
left_join(meta_meth,
rna_counts,
by = "Sample_Name",
suffix = c("", "_dup"))
}
QBNT = loadQBNT()
cowplot::plot_grid(
QBNT %>%
filter(grepl("^[QB]", Sample_Name)) %>%
ggplot(aes(x = avg_CNV_KDM5A,
y = KDM5A,
color = Cell_Line,
shape = Treatment)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression"),
QBNT %>%
filter(grepl("^[QB]", Sample_Name)) %>%
ggplot(aes(x = avg_CNV_KDM5A,
y = KDM5A,
color = Treatment,
shape = Cell_Line)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression")
)
cowplot::plot_grid(
QBNT %>%
filter(grepl("^[QB]", Sample_Name)) %>%
ggplot(aes(x = avg_CNV_KDM5B,
y = KDM5B,
color = Cell_Line,
shape = Treatment)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression"),
QBNT %>%
filter(grepl("^[QB]", Sample_Name)) %>%
ggplot(aes(x = avg_CNV_KDM5B,
y = KDM5B,
color = Treatment,
shape = Cell_Line)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression")
)
cowplot::plot_grid(
QBNT %>%
filter(grepl("^NT", Sample_Name)) %>%
ggplot(aes(x = avg_CNV_KDM5A,
y = KDM5A,
color = Cell_Line,
shape = Treatment)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression"),
QBNT %>%
filter(grepl("^NT", Sample_Name)) %>%
ggplot(aes(x = avg_CNV_KDM5A,
y = KDM5A,
color = Treatment,
shape = Cell_Line)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression")
)
cowplot::plot_grid(
QBNT %>%
filter(grepl("^NT", Sample_Name)) %>%
ggplot(aes(x = avg_CNV_KDM5B,
y = KDM5B,
color = Cell_Line,
shape = Treatment)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression"),
QBNT %>%
filter(grepl("NT", Sample_Name)) %>%
ggplot(aes(x = avg_CNV_KDM5B,
y = KDM5B,
color = Treatment,
shape = Cell_Line)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression")
)
The cell lines confirm that the link between copy number and gene expression of KDM5A / KDM5B is quite weak. The correlation varies between tumors and the different cell lines making it hard to predict gene expression from CNV.
All cell lines have low CNV and based on this are expected to respond well to CPI-455 inhibition. Due to experimental differences it is not trivial to compare the gene expression values to predict drug response. Based on relative gene expression QGP1 are expected to be slightly more sensitive than BON1 and NT18LM more sensitive compared to NT3.
For further analysis of the cell lines more meta data such as the DAXX / ATRX and MEN1 mutation status and ideally a measure of CPI-455 response will be beneficial.
5.NDD\DNAme\Analysis\CNA\CNA_corr_Meth_And_LOH_ICGC_UCL_UB/df_Plot_Bins_AllSamples_noNA.RdaThe expression data is analyzed using the same toolchain
(STAR + RSEM) and may be compared directly.
CNV data is available for 81 samples. Gene expression data is
available for 42 samples. For 19 both data types exist.
As mentioned above CNV values were corrected and are not mixed with
uncorrected estimates for the missing samples.
# Gene expression and methylation data are only available for some of the samples
# The union of samples is loaded
loadNDD = function(){
meta_meth = read_delim(file.path("/Bioinformatics/projects/shared_metadata/PanNET_methylation",
"221004_PanNET_methylation_annotation_DiDomenico_full.txt"),
delim = "\t",
show_col_types = F) %>%
dplyr::select(Sample_Name,
MEN1,
DAXX_ATRX,
CC_Epi_newLRO)
if (!file.exists("processed_data/230614_NDD_CNV_details.xlsx")){
# The chromosome arm data from Nunzia was found under 5.NDD\DNAme\Analysis\CNA\CNA_corr_Meth_And_LOH_ICGC_UCL_UB/df_Plot_Bins_AllSamples_noNA
# This plot data contains the corrected bins for most samples (UB06 is missing for some unspecified reason)
# There likely is a reason for this in the previous analysis steps
# Some bins were removed on accident if the End is the same as the chromosome arm border
# This should not be problematic due to the low number and location close to the centrosomes
# The legend of the heatmap is misleading because the cutoffs are not -1 and 1 but rather -0.2 and 0.2
load(file.path("/Bioinformatics/projects/045_Aurel_one_PanNET",
"processed_data/df_Plot_Bins_NoNA_AllSamples.RDa"))
anno = createAnno("450k")
bin_data = anno@bins %>%
as.data.frame() %>%
rownames_to_column("Feature") %>%
mutate(BinName = paste0(seqnames, "_", start, "_", end))
CNV_bins = left_join(df_Plot_Bins_AllSamples_noNA,
bin_data %>%
dplyr::select(Feature,
BinName),
by = "BinName") %>%
dplyr::select(-c(BinName,
Chromosome,
ChrArm.x)) %>%
pivot_longer(names_to = "Sample_Name",
values_to = "CNV",
-c(Feature,
Start,
End))
gene_details = setNames(lapply(detail_regions$name,
getCNVvalues,
CNV_bins = CNV_bins,
anno = anno),
nm = detail_regions$name)
write.xlsx(gene_details,
"processed_data/230614_NDD_CNV_details.xlsx")
}
kdm5a = read.xlsx("processed_data/230614_NDD_CNV_details.xlsx",
sheet = 2) %>%
dplyr::select(Sample_Name,
avg_CNV_KDM5A = average_CNV,
bin_1_KDM5A = `bin_chr12-0005`)
kdm5b = read.xlsx("processed_data/230614_NDD_CNV_details.xlsx",
sheet = 3) %>%
dplyr::select(Sample_Name,
avg_CNV_KDM5B = average_CNV,
bin_1_KDM5B = `bin_chr1-1269`,
bin_2_KDM5B = `bin_chr1-1270`)
meta_meth = left_join(meta_meth,
kdm5a,
by = "Sample_Name")
meta_meth = left_join(meta_meth,
kdm5b,
by = "Sample_Name")
# all samples matched?
Scarpa_counts = read_delim(file.path(expression_dir,
"Scarpa_2017",
"Scarpa_2017_GRCh37_Ensembl75_RSEM_TMM.csv"),
delim = ",",
show_col_types = F) %>%
filter(gene_name %in% c("KDM5A", "KDM5B", "KDM5C", "KDMD")) %>%
as.data.frame()
rownames(Scarpa_counts) = Scarpa_counts$gene_name
Scarpa_counts = t(Scarpa_counts[, grepl("^ICGC", colnames(Scarpa_counts))])
Scarpa_counts = Scarpa_counts %>%
as.data.frame() %>%
rownames_to_column("Sample_Name")
Chan_counts = read_delim(file.path(expression_dir,
"Chan_2018",
"Chan_2018_GRCh37_Ensembl75_RSEM_TMM.csv"),
delim = ",",
show_col_types = F) %>%
filter(gene_name %in% c("KDM5A", "KDM5B", "KDM5C", "KDMD")) %>%
as.data.frame()
rownames(Chan_counts) = Chan_counts$gene_name
Chan_counts = t(Chan_counts[, grepl("[0-9]$", colnames(Chan_counts))])
# The naming of the samples is so messed up that it is easier to fix manually
Chan_counts = Chan_counts %>%
as.data.frame() %>%
rownames_to_column("Sample_Name")
rna_counts = bind_rows(Scarpa_counts,
Chan_counts)
left_join(meta_meth,
rna_counts,
by = "Sample_Name",
suffix = c("", "_dup"))
}
NDD = loadNDD()
cowplot::plot_grid(
NDD %>%
filter(!is.na(NDD$KDM5A) & !is.na(NDD$avg_CNV_KDM5A)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan")) %>%
ggplot(aes(x = avg_CNV_KDM5A,
y = KDM5A,
color = Cohort)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression"),
NDD %>%
filter(!is.na(NDD$KDM5A) & !is.na(NDD$avg_CNV_KDM5A)) %>%
ggplot(aes(x = avg_CNV_KDM5A,
y = KDM5A,
color = CC_Epi_newLRO)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression"),
NDD %>%
filter(!is.na(NDD$KDM5A) & !is.na(NDD$avg_CNV_KDM5A)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan")) %>%
ggplot(aes(x = avg_CNV_KDM5A,
y = KDM5A,
color = DAXX_ATRX)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression"),
NDD %>%
filter(!is.na(NDD$KDM5A) & !is.na(NDD$avg_CNV_KDM5A)) %>%
ggplot(aes(x = avg_CNV_KDM5A,
y = KDM5A,
color = MEN1)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression"),
align = "v"
)
Intermediate_ADM shows very different KDM5 expressioncowplot::plot_grid(
NDD %>%
filter(!is.na(NDD$KDM5B) & !is.na(NDD$avg_CNV_KDM5B)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan")) %>%
ggplot(aes(x = avg_CNV_KDM5B,
y = KDM5B,
color = Cohort)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression"),
NDD %>%
filter(!is.na(NDD$KDM5B) & !is.na(NDD$avg_CNV_KDM5B)) %>%
ggplot(aes(x = avg_CNV_KDM5B,
y = KDM5B,
color = CC_Epi_newLRO)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression"),
NDD %>%
filter(!is.na(NDD$KDM5B) & !is.na(NDD$avg_CNV_KDM5B)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan")) %>%
ggplot(aes(x = avg_CNV_KDM5B,
y = KDM5B,
color = DAXX_ATRX)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression"),
NDD %>%
filter(!is.na(NDD$KDM5B) & !is.na(NDD$avg_CNV_KDM5B)) %>%
ggplot(aes(x = avg_CNV_KDM5B,
y = KDM5B,
color = MEN1)) +
geom_point(size = 3) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
labs(title = "CNV vs gene expression",
x = "average CNV at locus",
y = "log2 TMM gene expression"),
align = "v"
)
Intermediate_ADM shows very different KDM5 expressionAll samples have low CNV and based on this are expected to respond well to CPI-455 inhibition. When predicting CPI-455 response from gene expression two clusters with different expected sensitivity can be constructed.
cowplot::plot_grid(
NDD %>%
filter(!is.na(NDD$avg_CNV_KDM5A)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(avg_CNV_KDM5A)])) %>%
ggplot(aes(x = Sample_Name,
y = avg_CNV_KDM5A,
color = Cohort)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "CNV variability",
y = "average CNV at locus"),
NDD %>%
filter(!is.na(NDD$avg_CNV_KDM5A)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(avg_CNV_KDM5A)])) %>%
ggplot(aes(x = Sample_Name,
y = avg_CNV_KDM5A,
color = CC_Epi_newLRO)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "CNV variability",
y = "average CNV at locus"),
NDD %>%
filter(!is.na(NDD$avg_CNV_KDM5A)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(avg_CNV_KDM5A)])) %>%
ggplot(aes(x = Sample_Name,
y = avg_CNV_KDM5A,
color = DAXX_ATRX)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1", na.value = "grey50") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "CNV variability",
y = "average CNV at locus"),
NDD %>%
filter(!is.na(NDD$avg_CNV_KDM5A)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(avg_CNV_KDM5A)])) %>%
ggplot(aes(x = Sample_Name,
y = avg_CNV_KDM5A,
color = MEN1)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "CNV variability",
y = "average CNV at locus"),
ncol = 1,
align = "v"
)
The separation of the cohorts may indicate a technical effect driving CNV estimates
cowplot::plot_grid(
NDD %>%
filter(!is.na(NDD$avg_CNV_KDM5B)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(avg_CNV_KDM5B)])) %>%
ggplot(aes(x = Sample_Name,
y = avg_CNV_KDM5B,
color = Cohort)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "CNV variability",
y = "average CNV at locus"),
NDD %>%
filter(!is.na(NDD$avg_CNV_KDM5B)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(avg_CNV_KDM5B)])) %>%
ggplot(aes(x = Sample_Name,
y = avg_CNV_KDM5B,
color = CC_Epi_newLRO)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "CNV variability",
y = "average CNV at locus"),
NDD %>%
filter(!is.na(NDD$avg_CNV_KDM5B)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(avg_CNV_KDM5B)])) %>%
ggplot(aes(x = Sample_Name,
y = avg_CNV_KDM5B,
color = DAXX_ATRX)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1", na.value = "grey50") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "CNV variability",
y = "average CNV at locus"),
NDD %>%
filter(!is.na(NDD$avg_CNV_KDM5B)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(avg_CNV_KDM5B)])) %>%
ggplot(aes(x = Sample_Name,
y = avg_CNV_KDM5B,
color = MEN1)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "CNV variability",
y = "average CNV at locus"),
ncol = 1,
align = "v"
)
KDM5A and KDM5B display an inverted pattern.
cowplot::plot_grid(
NDD %>%
filter(!is.na(NDD$KDM5A)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(KDM5A)])) %>%
ggplot(aes(x = Sample_Name,
y = KDM5A,
color = Cohort)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "gene expression variability",
y = "log2 TMM gene expression"),
NDD %>%
filter(!is.na(NDD$KDM5A)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(KDM5A)])) %>%
ggplot(aes(x = Sample_Name,
y = KDM5A,
color = CC_Epi_newLRO)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "gene expression variability",
y = "log2 TMM gene expression"),
NDD %>%
filter(!is.na(NDD$KDM5A)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(KDM5A)])) %>%
ggplot(aes(x = Sample_Name,
y = KDM5A,
color = DAXX_ATRX)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1", na.value = "grey50") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "gene expression variability",
y = "log2 TMM gene expression"),
NDD %>%
filter(!is.na(NDD$KDM5A)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(KDM5A)])) %>%
ggplot(aes(x = Sample_Name,
y = KDM5A,
color = MEN1)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "gene expression variability",
y = "log2 TMM gene expression"),
ncol = 1,
align = "v"
)
cowplot::plot_grid(
NDD %>%
filter(!is.na(NDD$KDM5B)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(KDM5B)])) %>%
ggplot(aes(x = Sample_Name,
y = KDM5B,
color = Cohort)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "gene expression variability",
y = "log2 TMM gene expression"),
NDD %>%
filter(!is.na(NDD$KDM5B)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(KDM5B)])) %>%
ggplot(aes(x = Sample_Name,
y = KDM5B,
color = CC_Epi_newLRO)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "gene expression variability",
y = "log2 TMM gene expression"),
NDD %>%
filter(!is.na(NDD$KDM5B)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(KDM5B)])) %>%
ggplot(aes(x = Sample_Name,
y = KDM5B,
color = DAXX_ATRX)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1", na.value = "grey50") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "gene expression variability",
y = "log2 TMM gene expression"),
NDD %>%
filter(!is.na(NDD$KDM5B)) %>%
mutate(Cohort= ifelse(grepl("ICGC", Sample_Name), "Scarpa", "Chan"),
Sample_Name = factor(Sample_Name,
levels = Sample_Name[order(KDM5B)])) %>%
ggplot(aes(x = Sample_Name,
y = KDM5B,
color = MEN1)) +
geom_point(size = 2) +
scale_color_brewer(palette = "Set1") +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 5)) +
labs(title = "gene expression variability",
y = "log2 TMM gene expression"),
ncol = 1,
align = "v"
)
In contrast to the NETG1G2 samples all CNVs are rather low. The FISH based correction always resulted in an decrease of the CNVs calculated from DNA methylation data. Therefore the NETG1G2 values may be overestimates.
Due to the high spread in CNV and especially KDM5A / KDM5B expression it would be expected to observe difference in response to CPI-455 in these cases.
Many patterns are strongly correlated with the difference between the Scarpa and Chan data.
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] conumee_1.32.0
## [2] IlluminaHumanMethylationEPICmanifest_0.3.0
## [3] IlluminaHumanMethylationEPICanno.ilm10b2.hg19_0.6.0
## [4] IlluminaHumanMethylation450kmanifest_0.4.0
## [5] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
## [6] minfi_1.44.0
## [7] bumphunter_1.40.0
## [8] locfit_1.5-9.7
## [9] iterators_1.0.14
## [10] foreach_1.5.2
## [11] Biostrings_2.66.0
## [12] XVector_0.38.0
## [13] SummarizedExperiment_1.28.0
## [14] Biobase_2.58.0
## [15] MatrixGenerics_1.10.0
## [16] matrixStats_0.63.0
## [17] GenomicRanges_1.50.2
## [18] GenomeInfoDb_1.34.9
## [19] IRanges_2.32.0
## [20] S4Vectors_0.36.2
## [21] BiocGenerics_0.44.0
## [22] ggrepel_0.9.3
## [23] openxlsx_4.2.5.2
## [24] lubridate_1.9.2
## [25] forcats_1.0.0
## [26] stringr_1.5.0
## [27] dplyr_1.1.0
## [28] purrr_1.0.1
## [29] readr_2.1.4
## [30] tidyr_1.3.0
## [31] tibble_3.2.0
## [32] ggplot2_3.4.1
## [33] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] BiocFileCache_2.6.1 plyr_1.8.8 splines_4.2.2
## [4] BiocParallel_1.32.6 digest_0.6.31 htmltools_0.5.4
## [7] fansi_1.0.4 magrittr_2.0.3 memoise_2.0.1
## [10] tzdb_0.3.0 limma_3.54.2 annotate_1.76.0
## [13] vroom_1.6.1 askpass_1.1 timechange_0.2.0
## [16] siggenes_1.72.0 prettyunits_1.1.1 colorspace_2.1-0
## [19] blob_1.2.3 rappdirs_0.3.3 xfun_0.37
## [22] jsonlite_1.8.4 crayon_1.5.2 RCurl_1.98-1.10
## [25] genefilter_1.80.3 GEOquery_2.66.0 survival_3.4-0
## [28] glue_1.6.2 gtable_0.3.1 zlibbioc_1.44.0
## [31] DelayedArray_0.24.0 Rhdf5lib_1.20.0 HDF5Array_1.26.0
## [34] scales_1.2.1 DBI_1.1.3 rngtools_1.5.2
## [37] Rcpp_1.0.10 xtable_1.8-4 progress_1.2.2
## [40] bit_4.0.5 mclust_6.0.0 preprocessCore_1.60.2
## [43] httr_1.4.5 RColorBrewer_1.1-3 ellipsis_0.3.2
## [46] farver_2.1.1 pkgconfig_2.0.3 reshape_0.8.9
## [49] XML_3.99-0.13 sass_0.4.5 dbplyr_2.3.1
## [52] DNAcopy_1.72.3 utf8_1.2.3 labeling_0.4.2
## [55] tidyselect_1.2.0 rlang_1.0.6 AnnotationDbi_1.60.2
## [58] munsell_0.5.0 tools_4.2.2 cachem_1.0.7
## [61] cli_3.6.0 generics_0.1.3 RSQLite_2.3.0
## [64] evaluate_0.20 fastmap_1.1.1 yaml_2.3.7
## [67] knitr_1.42 bit64_4.0.5 zip_2.2.2
## [70] beanplot_1.3.1 scrime_1.3.5 KEGGREST_1.38.0
## [73] nlme_3.1-160 doRNG_1.8.6 sparseMatrixStats_1.10.0
## [76] nor1mix_1.3-0 xml2_1.3.3 biomaRt_2.54.1
## [79] compiler_4.2.2 rstudioapi_0.14 filelock_1.0.2
## [82] curl_5.0.0 png_0.1-8 bslib_0.4.2
## [85] stringi_1.7.12 highr_0.10 GenomicFeatures_1.50.4
## [88] lattice_0.20-45 Matrix_1.5-1 multtest_2.54.0
## [91] vctrs_0.5.2 pillar_1.8.1 lifecycle_1.0.3
## [94] rhdf5filters_1.10.1 jquerylib_0.1.4 data.table_1.14.8
## [97] cowplot_1.1.1 bitops_1.0-7 rtracklayer_1.58.0
## [100] R6_2.5.1 BiocIO_1.8.0 codetools_0.2-18
## [103] MASS_7.3-58.1 rhdf5_2.42.1 openssl_2.0.6
## [106] rjson_0.2.21 withr_2.5.0 GenomicAlignments_1.34.1
## [109] Rsamtools_2.14.0 GenomeInfoDbData_1.2.9 hms_1.1.2
## [112] quadprog_1.5-8 grid_4.2.2 base64_2.0.1
## [115] rmarkdown_2.20 DelayedMatrixStats_1.20.0 illuminaio_0.40.0
## [118] restfulr_0.0.15